You’ll see someone always has a happy face and someone keeps upset. But why? Curious about why people feel happy, it’s interesting to investigate the happy moment of people who have different countries, genders, ages, marriages and so on. To learn more, a research about happy moment started and analysis follows.

Step 0: read data and load libraries

Step 1: Process data and give some basic descriptive statistics

hm_data <- hm_data %>%
  inner_join(demo_data, by = "wid") %>%
  select(wid,
         original_hm,
         gender, 
         marital, 
         parenthood,
         reflection_period,
         age, 
         country, 
         ground_truth_category, 
         text) %>%
  mutate(count = sapply(hm_data$text, wordcount)) %>%
  mutate(word.count = sapply(hm_data$original_hm, wordcount)) %>%
  filter(gender %in% c("m", "f")) %>%
  filter(marital %in% c("single", "married")) %>%
  filter(parenthood %in% c("n", "y")) %>%
  filter(reflection_period %in% c("24h", "3m")) %>%
  mutate(reflection_period = fct_recode(reflection_period, 
                                        months_3 = "3m", hours_24 = "24h"))
head(hm_data,5)
## # A tibble: 5 x 12
##     wid original_hm gender marital parenthood reflection_peri~ age  
##   <int> <chr>       <chr>  <chr>   <chr>      <fct>            <chr>
## 1  2053 I went on ~ m      single  n          hours_24         35   
## 2     2 I was happ~ m      married y          hours_24         29.0 
## 3  1936 I went to ~ f      married y          hours_24         30   
## 4   206 We had a s~ f      married n          hours_24         28   
## 5    45 I meditate~ m      single  n          hours_24         23   
## # ... with 5 more variables: country <chr>, ground_truth_category <chr>,
## #   text <chr>, count <int>, word.count <int>
suppressWarnings(describe.by(hm_data))
##                        vars     n    mean      sd median trimmed     mad
## wid                       1 94574 2680.07 3487.41   1097 1925.46 1306.17
## original_hm*              2 94574     NaN      NA     NA     NaN      NA
## gender*                   3 94574     NaN      NA     NA     NaN      NA
## marital*                  4 94574     NaN      NA     NA     NaN      NA
## parenthood*               5 94574     NaN      NA     NA     NaN      NA
## reflection_period*        6 94574    1.50    0.50      2    1.50    0.00
## age*                      7 94541   31.96   10.72     30   30.50    7.41
## country*                  8 94427     NaN      NA     NA     NaN      NA
## ground_truth_category*    9 13379     NaN      NA     NA     NaN      NA
## text*                    10 94574     NaN      NA     NA     NaN      NA
## count                    11 94574    6.16    7.65      5    5.09    2.97
## word.count               12 94574   18.31   21.63     14   14.97    8.90
##                        min   max range  skew kurtosis    se
## wid                      1 13839 13838  1.68     1.75 11.34
## original_hm*           Inf  -Inf  -Inf    NA       NA    NA
## gender*                Inf  -Inf  -Inf    NA       NA    NA
## marital*               Inf  -Inf  -Inf    NA       NA    NA
## parenthood*            Inf  -Inf  -Inf    NA       NA    NA
## reflection_period*       1     2     1 -0.02    -2.00  0.00
## age*                     2   233   231  5.14    77.75  0.03
## country*               Inf  -Inf  -Inf    NA       NA    NA
## ground_truth_category* Inf  -Inf  -Inf    NA       NA    NA
## text*                  Inf  -Inf  -Inf    NA       NA    NA
## count                    1   509   508 20.56   907.01  0.02
## word.count               2  1155  1153 14.84   493.48  0.07

The data seems good. It has more than 100 thousand observations and 12 variables. The original text “original_hm” has been cleaned and the key words has been put into “text”.

Let’s start with the sentence length. How many words did people use to describe their happy moment?

sentence.length <- hm_data %>% 
  group_by(gender) %>% 
  count(word.count, sort = TRUE) %>% 
  left_join(hm_data %>% 
              group_by(gender) %>% 
              summarise(total = n())) %>%
  mutate(freq = n/total)
## Joining, by = "gender"
head(sentence.length,5)
## # A tibble: 5 x 5
## # Groups:   gender [1]
##   gender word.count     n total   freq
##   <chr>       <int> <int> <int>  <dbl>
## 1 m               8  3430 55840 0.0614
## 2 m              10  3286 55840 0.0588
## 3 m              11  3258 55840 0.0583
## 4 m               9  3256 55840 0.0583
## 5 m               7  3076 55840 0.0551

It seems people usually use 8 to 13 words to record their happy moment. Female would like to speak more about the happy moment, maybe they have more feelings to express.

ggplot(sentence.length, aes(x = word.count, y= freq, group = factor(1))) +
  geom_bar(stat = "identity", color = "cornflowerblue")+
  ylim(0,0.07) + 
  xlim(0,100) +
  theme_bw() +
  facet_wrap(~ gender, ncol = 2) +
  labs(title = "Sentence Length for Male and Female", x = "Word Numbers", y = "Frequency")
## Warning: Removed 246 rows containing missing values (position_stack).

head(sentence.length$word.count[sentence.length$gender == "f"],5)
## [1] 11 12 10  9 13
head(sentence.length$word.count[sentence.length$gender == "m"],5)
## [1]  8 10 11  9  7

Step 2: Create a bag of words using the text data and analyze the word frequency in different categories

data(stop_words)
bag_of_words <-  hm_data %>%
  unnest_tokens(word, text) 
head(bag_of_words,5)
## # A tibble: 5 x 12
##     wid original_hm gender marital parenthood reflection_peri~ age  
##   <int> <chr>       <chr>  <chr>   <chr>      <fct>            <chr>
## 1  2053 I went on ~ m      single  n          hours_24         35   
## 2  2053 I went on ~ m      single  n          hours_24         35   
## 3  2053 I went on ~ m      single  n          hours_24         35   
## 4  2053 I went on ~ m      single  n          hours_24         35   
## 5     2 I was happ~ m      married y          hours_24         29.0 
## # ... with 5 more variables: country <chr>, ground_truth_category <chr>,
## #   count <int>, word.count <int>, word <chr>

Word cloud

Overall, time, friend and day are of the highest frequency. Then, home, family, game and others make people feel happy.

frequency_word <- bag_of_words %>% 
  count(word)
word_cloud <- data.frame(word = frequency_word$word, freq = frequency_word$n)
set.seed(1)
suppressWarnings(wordcloud(words = word_cloud$word, freq = word_cloud$freq, min.freq = 1,max.words=200, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2")))

Gender Analysis

We continue to discuss the difference of happy moment of male and female.

frequency_gender <- bag_of_words %>% 
  group_by(gender) %>% 
  count(word, sort = TRUE) %>% 
  left_join(bag_of_words %>% 
              group_by(gender) %>% 
              summarise(total = n())) %>%
  mutate(freq = n/total)
## Joining, by = "gender"
head(frequency_gender,5)
## # A tibble: 5 x 5
## # Groups:   gender [2]
##   gender word       n  total   freq
##   <chr>  <chr>  <int>  <int>  <dbl>
## 1 m      friend  6366 340130 0.0187
## 2 m      day     5267 340130 0.0155
## 3 m      time    5142 340130 0.0151
## 4 f      day     4103 242776 0.0169
## 5 f      time    4055 242776 0.0167
frequency_gender <- frequency_gender %>% 
  select(gender, word, freq) %>% 
  spread(gender, freq) %>%
  arrange(m, f)

frequency_gender
## # A tibble: 18,812 x 3
##    word                   f          m
##    <chr>              <dbl>      <dbl>
##  1 abdomen       0.00000412 0.00000294
##  2 abdominal     0.00000412 0.00000294
##  3 abnormalities 0.00000412 0.00000294
##  4 aca           0.00000412 0.00000294
##  5 achy          0.00000412 0.00000294
##  6 adidas        0.00000412 0.00000294
##  7 administered  0.00000412 0.00000294
##  8 afternoonget  0.00000412 0.00000294
##  9 aggregate     0.00000412 0.00000294
## 10 alexis        0.00000412 0.00000294
## # ... with 18,802 more rows
ggplot(frequency_gender, aes(f, m)) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25, color = "#56B4E9") +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  geom_abline(color = "black") + 
  theme_bw() +
  labs(title = "Word Freqencies for Male and Female", x = "Female", y = "Male")
## Warning: Removed 11068 rows containing missing values (geom_point).
## Warning: Removed 11068 rows containing missing values (geom_text).

The figure above shows people in both gender feel good due to victory without difference, and they persue winning something. No doubts it’s a popular value at present. What’s more, males and females are both feel happy due to their partners of high frequency. The words, “husband” and “boyfriend” can delight females significantly. As for difference, females feel good from people around them, such as sister, dad and kids. Jewelry and kindergarten as well help females feel better. For males, beer, mobile, and hill please them a lot.

Country Analysis

Then, we move to different countries. Choose the countries of large population and area with different cultures.

frequency_country <- bag_of_words %>% 
  group_by(country) %>% 
  count(word, sort = TRUE) %>% 
  left_join(bag_of_words %>% 
              group_by(country) %>% 
              summarise(total = n())) %>%
  mutate(freq = n/total)
## Joining, by = "country"
head(frequency_country,5)
## # A tibble: 5 x 5
## # Groups:   country [2]
##   country word        n  total    freq
##   <chr>   <chr>   <int>  <int>   <dbl>
## 1 USA     friend   6917 417407 0.0166 
## 2 USA     time     6191 417407 0.0148 
## 3 USA     day      5812 417407 0.0139 
## 4 USA     watched  3389 417407 0.00812
## 5 IND     day      3167 138456 0.0229
ggplot(frequency_country, aes(x=country, y=freq)) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25, color = "#56B4E9") +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  theme_bw() +
  scale_y_log10(labels = percent_format()) +
  scale_x_discrete(limits=c("USA", "IND", "CAN", "JPN", "MEX", "PAK", "ITA", "GBR", "IDN", "DEU","FRA")) +
  labs(title = "Word Freqencies for Different Countries", x = "Country", y = "Frequency")
## Warning: Removed 11413 rows containing missing values (geom_point).
## Warning: Removed 11413 rows containing missing values (geom_text).

The figure above shows that people in America, Canada, Japan, Indonesia and France really enjoy their life with friends’ company. Englishmen desire success but they also love afternoon tea. Italians need to find a balance between two amazing things, cars and beer. A wonderful life in their mind is driving a car to drink some beer. Indians and Pakistan cherish their family members more. Big house satisfies Pakistan’s dream about perfect life.

Marriage Analysis

frequency_marital <- bag_of_words %>% 
  group_by(marital) %>% 
  count(word, sort = TRUE) %>% 
  left_join(bag_of_words %>% 
              group_by(marital) %>% 
              summarise(total = n())) %>%
  mutate(freq = n/total)
## Joining, by = "marital"
head(frequency_marital,5)
## # A tibble: 5 x 5
## # Groups:   marital [2]
##   marital word       n  total   freq
##   <chr>   <chr>  <int>  <int>  <dbl>
## 1 single  friend  6871 315218 0.0218
## 2 single  day     5055 315218 0.0160
## 3 single  time    4918 315218 0.0156
## 4 married day     4315 267688 0.0161
## 5 married time    4279 267688 0.0160
frequency_marital <- frequency_marital %>% 
  select(marital, word, freq) %>% 
  spread(marital, freq) %>%
  arrange(single, married)

frequency_marital
## # A tibble: 18,812 x 3
##    word             married     single
##    <chr>              <dbl>      <dbl>
##  1 abdominal     0.00000374 0.00000317
##  2 abig          0.00000374 0.00000317
##  3 abnormalities 0.00000374 0.00000317
##  4 aca           0.00000374 0.00000317
##  5 accent        0.00000374 0.00000317
##  6 acclimated    0.00000374 0.00000317
##  7 achy          0.00000374 0.00000317
##  8 acute         0.00000374 0.00000317
##  9 aerospace     0.00000374 0.00000317
## 10 afternoonget  0.00000374 0.00000317
## # ... with 18,802 more rows
ggplot(frequency_marital, aes(single, married)) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25, color = "#56B4E9") +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  geom_abline(color = "black") + 
  theme_bw() +
  labs(title = "Word Freqencies for Singe and Married People", x = "Single", y = "Married")
## Warning: Removed 11024 rows containing missing values (geom_point).
## Warning: Removed 11024 rows containing missing values (geom_text).

“Home” and “house” make all the people married or not feel warm and happy. A single person can enjoy their “time” with their girlfriend or boyfriend, roommate, fiancee, and even a cat. But married people enjoy more happy moments with their children. It may results from they have became parent. Therefore, next analysis is about parenthood.

Parenthood Analysis

frequency_parenthood <- bag_of_words %>% 
  group_by(parenthood) %>% 
  count(word, sort = TRUE) %>% 
  left_join(bag_of_words %>% 
              group_by(parenthood) %>% 
              summarise(total = n())) %>%
  mutate(freq = n/total)
## Joining, by = "parenthood"
head(frequency_parenthood,5)
## # A tibble: 5 x 5
## # Groups:   parenthood [2]
##   parenthood word       n  total   freq
##   <chr>      <chr>  <int>  <int>  <dbl>
## 1 n          friend  7312 345196 0.0212
## 2 n          day     5457 345196 0.0158
## 3 n          time    5399 345196 0.0156
## 4 y          day     3913 237710 0.0165
## 5 y          time    3798 237710 0.0160
frequency_parenthood <- frequency_parenthood %>% 
  select(parenthood, word, freq) %>% 
  spread(parenthood, freq) %>%
  arrange(y, n)

frequency_parenthood
## # A tibble: 18,812 x 3
##    word                  n          y
##    <chr>             <dbl>      <dbl>
##  1 abdomen      0.00000290 0.00000421
##  2 abdominal    0.00000290 0.00000421
##  3 aca          0.00000290 0.00000421
##  4 accent       0.00000290 0.00000421
##  5 acclimated   0.00000290 0.00000421
##  6 achy         0.00000290 0.00000421
##  7 acute        0.00000290 0.00000421
##  8 administered 0.00000290 0.00000421
##  9 aerospace    0.00000290 0.00000421
## 10 afternoonget 0.00000290 0.00000421
## # ... with 18,802 more rows
ggplot(frequency_parenthood, aes(y,n)) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25, color = "#56B4E9") +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  geom_abline(color = "black") + 
  theme_bw() +
  labs(title = "Word Freqencies for Parenthood or not", x = "Y", y = "N")
## Warning: Removed 11230 rows containing missing values (geom_point).
## Warning: Removed 11230 rows containing missing values (geom_text).

Actually, parents pay more attention to their kids or grandkids, but people without a baby consider their midterm more. It’s reasonable since students who care exams have high probability of not having a kid. Whatever a parent or not, people need tasty food, satisfied job, amazing parties and good car to feel good.

Marriage Analysis

But what will leave in your memory for longer time?

frequency_reflection_period <- bag_of_words %>% 
  group_by(reflection_period) %>% 
  count(word, sort = TRUE) %>% 
  left_join(bag_of_words %>% 
              group_by(reflection_period) %>% 
              summarise(total = n())) %>%
  mutate(freq = n/total)
## Joining, by = "reflection_period"
head(frequency_reflection_period,5)
## # A tibble: 5 x 5
## # Groups:   reflection_period [2]
##   reflection_period word       n  total   freq
##   <fct>             <chr>  <int>  <int>  <dbl>
## 1 months_3          friend  5552 302221 0.0184
## 2 months_3          day     5281 302221 0.0175
## 3 months_3          time    4883 302221 0.0162
## 4 hours_24          friend  4775 280685 0.0170
## 5 hours_24          time    4314 280685 0.0154
frequency_reflection_period <- frequency_reflection_period %>% 
  select(reflection_period, word, freq) %>% 
  spread(reflection_period, freq) %>%
  arrange(hours_24, months_3)

frequency_reflection_period
## # A tibble: 18,812 x 3
##    word           hours_24   months_3
##    <chr>             <dbl>      <dbl>
##  1 abdomen      0.00000356 0.00000331
##  2 abdominal    0.00000356 0.00000331
##  3 aca          0.00000356 0.00000331
##  4 achy         0.00000356 0.00000331
##  5 adidas       0.00000356 0.00000331
##  6 afterall     0.00000356 0.00000331
##  7 afternoonget 0.00000356 0.00000331
##  8 agreeable    0.00000356 0.00000331
##  9 ahi          0.00000356 0.00000331
## 10 aircondition 0.00000356 0.00000331
## # ... with 18,802 more rows
ggplot(frequency_reflection_period, aes(hours_24, months_3)) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25, color = "#56B4E9") +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  geom_abline(color = "black") + 
  theme_bw() +
  labs(title = "Word Freqencies for Different Reflection Period", x = "24 Hours", y = "3 Months")
## Warning: Removed 10747 rows containing missing values (geom_point).
## Warning: Removed 10747 rows containing missing values (geom_text).

Seems a good morning, a cute dog, or even a interesting video can make your day. But with time passes, the important date, such as Valentine’s Day, graduation day and birthday impress you the most in three months.

Marriage Analysis

Will things become different in varous ages?

frequency_age <- bag_of_words %>% 
  group_by(age) %>% 
  count(word, sort = TRUE) %>% 
  left_join(bag_of_words %>% 
              group_by(age) %>% 
              summarise(total = n())) %>%
  mutate(freq = n/total)
## Joining, by = "age"
# Remove all the non-numeric characters in age column and convert age to numeric 
pattern <- "[0-9]*"
frequency_age <- frequency_age[grepl("[0-9]*",frequency_age$age),]
frequency_age <- frequency_age[!grepl("prefer not to say",frequency_age$age),]
frequency_age <- frequency_age[!grepl("[a-z]^",frequency_age$age),]
frequency_age$age <- as.integer(frequency_age$age)
## Warning: NAs introduced by coercion
frequency_age <-na.omit(frequency_age)
frequency_age <- frequency_age[frequency_age$age <= 95,]
max(frequency_age$age)
## [1] 95
nrow(frequency_age)
## [1] 147003
ggplot(frequency_age, aes(x = age, y = freq)) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25, color = "#56B4E9") +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_continuous(limits = c(0,100)) +
  theme_bw() +
  scale_y_log10(labels = percent_format()) +
  labs(title = "Word Freqencies for Different Ages", x = "Age", y = "Frequency")

When you are a child, you may feel good because birthday gift is a new bike. But when you grow up, friends support you and you start to pursue sucess in your career. Then getting older makes you want to spend more time with family. When you are around 80 years old, planting become attractive.

frequency_age$age_range <- floor(frequency_age$age/10)*10 

frequency_age %>%
  group_by(age_range, word) %>%
  summarise(freq = sum(freq)) %>%
   left_join(frequency_age %>% 
              group_by(age_range, word))
## Joining, by = c("age_range", "word", "freq")
## # A tibble: 40,510 x 6
## # Groups:   age_range [?]
##    age_range word            freq   age     n total
##        <dbl> <chr>          <dbl> <int> <int> <int>
##  1         0 actress      0.0217      2     1    46
##  2         0 aircondition 0.00461     3     1   217
##  3         0 appreciated  0.0217      2     1    46
##  4         0 backyard     0.0217      2     1    46
##  5         0 bag          0.00461     3     1   217
##  6         0 bar          0.00461     3     1   217
##  7         0 beach        0.00461     3     1   217
##  8         0 beer         0.0217      2     1    46
##  9         0 bike         0.00922     3     2   217
## 10         0 biriyani     0.00461     3     1   217
## # ... with 40,500 more rows
frequency_age %>%
  group_by(age_range) %>%
  arrange(desc(freq)) %>%
  slice(1:5) %>%
  ggplot(aes(word, freq, fill = freq)) +
  geom_col(show.legend = FALSE) + 
  facet_wrap(~ age_range, scales = "free") +
  ylab("Frequency") +
  coord_flip()

When investigate deeper, some details attract me. Students concern passing the tests and receive an offer of admission from dream school. Granddaughter could be really sweet to grandparents in their sixties.

Step 3: Sentiment analysis and correlation test

Sentiment Analysis

At first, get the overall average sentiment score according to the ground truth category. No doubts the happy moments contain almost positive elements in every category.

words_by_category <- bag_of_words %>%
  count(ground_truth_category, word, sort = TRUE) %>%
  ungroup()
words_by_category
## # A tibble: 31,191 x 3
##    ground_truth_category word        n
##    <chr>                 <chr>   <int>
##  1 <NA>                  friend   8679
##  2 <NA>                  day      7988
##  3 <NA>                  time     7795
##  4 <NA>                  family   3779
##  5 <NA>                  watched  3534
##  6 <NA>                  home     3418
##  7 <NA>                  played   3303
##  8 <NA>                  feel     3259
##  9 <NA>                  finally  3151
## 10 <NA>                  found    3031
## # ... with 31,181 more rows
word_sentiments <- words_by_category %>%
  inner_join(get_sentiments("afinn"), by = "word") %>%
  group_by(ground_truth_category) %>%
  summarize(score = sum(score * n) / sum(n))

word_sentiments %>%
  mutate(ground_truth_category = reorder(ground_truth_category, score)) %>%
  ggplot(aes(ground_truth_category, score, fill = score > 0)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  ylab("Average sentiment score")

Then, sentiment analysis by word shows the majority of words express positive feelings and only two words “lost” and “bad” are negative. Guess the happy moment happened after the “lost” and “bad” moment.

contributions <- bag_of_words %>%
  inner_join(get_sentiments("afinn"), by = "word") %>%
  group_by(word) %>%
  summarize(occurences = n(),
            contribution = sum(score))

contributions
## # A tibble: 820 x 3
##    word         occurences contribution
##    <chr>             <int>        <int>
##  1 abandoned            19          -38
##  2 ability              62          124
##  3 aboard               10           10
##  4 absentee              1           -1
##  5 absorbed              6            6
##  6 abusive              16          -48
##  7 accepted            562          562
##  8 accident            103         -206
##  9 accidentally         37          -74
## 10 accomplished        309          618
## # ... with 810 more rows
contributions %>%
  top_n(25, abs(contribution)) %>%
  mutate(word = reorder(word, contribution)) %>%
  ggplot(aes(word, contribution, fill = contribution > 0)) +
  geom_col(show.legend = FALSE) +
  coord_flip()

Now, try sentiment analysis by word in different ground truth categories. The negative word appears in achievement category, and the guess above may be right. First you lose and then you get something makes you happy.

top_sentiment_words <- words_by_category %>%
  inner_join(get_sentiments("afinn"), by = "word") %>%
  mutate(contribution = score * n / sum(n))

top_sentiment_words <- na.omit(top_sentiment_words)
top_sentiment_words %>%
  group_by(ground_truth_category) %>%
  top_n(5, abs(contribution)) %>%
  ungroup() %>%
  mutate(word = reorder(word, contribution)) %>%
  ggplot(aes(word, contribution, fill = contribution > 0)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ ground_truth_category, scales = "free", ncol = 3) +
  coord_flip()

Correlation Test

Try to find some correlation between different category and the happy moment. Find tf-idf within different ground truth categories.

tf_idf <- words_by_category %>%
  bind_tf_idf(word, ground_truth_category, n) %>%
  arrange(desc(tf_idf))
tf_idf
## # A tibble: 31,191 x 6
##    ground_truth_category word         n      tf   idf  tf_idf
##    <chr>                 <chr>    <int>   <dbl> <dbl>   <dbl>
##  1 exercise              exercise    32 0.0345  0.560 0.0193 
##  2 nature                blooming    11 0.00778 1.25  0.00975
##  3 exercise              gym         53 0.0571  0.154 0.00880
##  4 affection             daughter   419 0.0136  0.560 0.00761
##  5 leisure               movie      228 0.0441  0.154 0.00679
##  6 leisure               wend        28 0.00541 1.25  0.00678
##  7 exercise              yoga        10 0.0108  0.560 0.00603
##  8 exercise              workout     33 0.0356  0.154 0.00548
##  9 exercise              jog          6 0.00647 0.847 0.00548
## 10 exercise              weight       9 0.00970 0.560 0.00543
## # ... with 31,181 more rows

The 10 terms with the highest tf-idf within each of the ground truth category.

tf_idf %>%
  group_by(ground_truth_category) %>%
  top_n(10, tf_idf) %>%
  ungroup() %>%
  mutate(word = reorder(word, tf_idf)) %>%
  ggplot(aes(word, tf_idf, fill = ground_truth_category)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ ground_truth_category, scales = "free") +
  ylab("tf-idf") +
  coord_flip()

category_cors <- words_by_category %>%
  pairwise_cor(ground_truth_category, word, n, sort = TRUE)
## Warning: Trying to compute distinct() for variables not found in the data:
## - `row_col`, `column_col`
## This is an error, but only a warning is raised for compatibility reasons.
## The operation will return the input unchanged.
category_cors <- na.omit(category_cors)
category_cors
## # A tibble: 42 x 3
##    item1            item2            correlation
##    <chr>            <chr>                  <dbl>
##  1 enjoy_the_moment affection              0.612
##  2 affection        enjoy_the_moment       0.612
##  3 enjoy_the_moment achievement            0.575
##  4 achievement      enjoy_the_moment       0.575
##  5 achievement      affection              0.463
##  6 affection        achievement            0.463
##  7 enjoy_the_moment leisure                0.434
##  8 leisure          enjoy_the_moment       0.434
##  9 nature           enjoy_the_moment       0.394
## 10 enjoy_the_moment nature                 0.394
## # ... with 32 more rows
category_cors %>%
  filter(correlation > .1) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(alpha = correlation, width = correlation)) +
  geom_node_point(size = 6, color = "lightblue") +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void()

Seems every category has relationship with each other. Maybe because the they both use the similar happy words to describe happy moment.

Step 4: Create bigrams and find the high frequency phrases

usenet_bigrams <- hm_data %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

usenet_bigram_counts <- usenet_bigrams %>%
  count(ground_truth_category, bigram, sort = TRUE) %>%
  ungroup() %>%
  separate(bigram, c("word1", "word2"), sep = " ")

usenet_bigram_counts <- na.omit(usenet_bigram_counts[order(usenet_bigram_counts$n),])

# Pick top ten first word with highest frequency in the bigram data
head(sort(table(usenet_bigram_counts$word1), decreasing = T),10)
## 
##     day    time  friend   found finally  played watched    home     son 
##     566     510     501     321     315     289     268     264     262 
##    feel 
##     257

Select the top ten first word with highest frequency in the bigram data to find the phrases.

happy_words <- c("day", "time","friend","finally", "found", "watched", "played","daughter", "son", "home")

usenet_bigram_counts %>%
  filter(word1 %in% happy_words) %>%
  count(word1, word2, wt = n, sort = TRUE) %>%
  inner_join(get_sentiments("afinn"), by = c(word2 = "word")) %>%
  mutate(contribution = score * nn) %>%
  group_by(word1) %>%
  top_n(10, abs(contribution)) %>%
  ungroup() %>%
  mutate(word2 = reorder(paste(word2, word1, sep = "__"), contribution)) %>%
  ggplot(aes(word2, contribution, fill = contribution > 0)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ word1, scales = "free", nrow = 3) +
  scale_x_discrete(labels = function(x) gsub("__.+$", "", x)) +
  xlab("Words preceded by a happy word") +
  ylab("Sentiment score * # of occurrences") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  coord_flip()

When you finally finish something, you feel really good. You will be glad to hear the good news from your kids. Three things also make your day: stay at home, have fun with friends and watch video.

Step 5: Topic modeling and create some labels for people

If we ignore the ground truth category and find some new classifications.

# include only words that occur at least 50 times
topic_word <- bag_of_words %>%
  group_by(word) %>%
  mutate(word_total = n()) %>%
  ungroup() %>%
  filter(word_total > 50)

# convert into a document-term matrix
# with document names such as sci.crypt_14147
topic_dtm <- topic_word %>%
  unite(document, ground_truth_category, wid) %>%
  count(document, word) %>%
  cast_dtm(document, word, n)
## Warning: Trying to compute distinct() for variables not found in the data:
## - `row_col`, `column_col`
## This is an error, but only a warning is raised for compatibility reasons.
## The operation will return the input unchanged.
library(topicmodels)
topic_lda <- LDA(topic_dtm, k = 8, control = list(seed = 2018))
topic_lda %>%
  tidy() %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free_y") +
  coord_flip()

Divide the sample into eight types, and we can create some labels for them. Type 1 could be someone who really enjoys family life and expects surprise. Type 2 could be a primary school student. Type 3 could be professional women in their twenties. Type 4 could be beautiful girls who like shopping and trip. Type 5 could be college boy who is dating with someone. Type 6 could be married guys who have one or more kids. Type 7 could be a positive person who love the world. Type 8 could be a quiet guy who raises a dog or cat and loves reading.

Interesting analysis about happy moment ends here.